3DS : Cherry Blossom
Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser
Complete the following lab as a group. This document should exist in your GitHub repo while you’re working on it. Your code should be heavily commented so someone reading your code can follow along easily. See the first code snippet below for an example of commented code.
Here’s the catch: For any given problem, the person writing the code should not be the person commenting that code, and every person must both code AND comment at least one problem in this lab (you decide how to split the work). This will involve lots of pushing and pulling through Git, and you may have to resolve conflicts if you’re not careful!
ALSO, all plots generated should have labeled axes, titles, and legends when appropriate. Don’t forget units of measurement! Make sure these plots could be interpreted by your client.
# load packages
library(dplyr)
library(ggplot2)
library(lubridate)
library(chron)
library(plotly)
library(purrr)
Cherry Blossom Race Plotting Problems
Question 1
- Looking at race times all on their own.
- Import the data from
CBdata.1_10.RDataand combine all 10 year’s worth of data into one data frame, removing observations with missing age or time data (this should be very similar to previous labs). For each year, create a histogram of race times and plot a the corresponding density curve in the same figure. Arrange all ten of these plots into one figure
- Import the data from
I am modifying cleaned_data from lab2
# load cherry blossom data
load("CBdata.1_10.RData")
# create one df and remove 'Pis/Tis' column and NAs from 'Age' and 'Time'
cleaned_data <- bind_rows(CBdata.1_10) %>%
filter(!is.na(Age)) %>%
filter(!is.na(Time) & Time != "") %>%
select(-'PiS/TiS')
# print out random observations to check
print(cleaned_data[c(137, 180, 5404), ])
## Race Name Age Time Pace Division PiD/TiD Hometown
## 137 1974 10M Joris Hogan (M) 28 1:03:97 6:28 M2529 NR NR
## 180 1974 10M John Gibbons (M) 23 1:07:91 6:51 M2024 NR NR
## 5404 1979 10M Erik Meyers (M) 29 1:02:49 6:17 M2529 99/438 Arlington, VA
## Year
## 137 1974
## 180 1974
## 5404 1979
# Remove any none-numeric and non-column characters from 'Time'
cleaned_data$Time <- gsub("[^0-9:]", "", cleaned_data$Time)
# Remove NA's from 'Time'
cleaned_data <- cleaned_data[!is.na(cleaned_data$Time), , drop = FALSE]
# Remove empty strings from 'Time'
cleaned_data <- cleaned_data[cleaned_data$Time != '', , drop = FALSE]
# Converts 'Time' column to a time format
cleaned_data$Time <- chron::times(cleaned_data$Time)
## Warning in convert.times(times., fmt): time-of-day entries out of range in
## positions 137,180,5404 set to NA
# Removes any potential NA's generated from the conversion above
cleaned_data <- cleaned_data[!is.na(cleaned_data$Time), , drop = FALSE]
# calculate the mean, min, max per 'Year' for 'Time'
summary_by_year <- cleaned_data %>%
group_by(Year) %>%
summarize(
mean_Time = mean(Time),
min_Time = min(Time),
max_Time = max(Time),
participants = n()
)
# print out summary generated above to have an idea of our data distribution
print(summary_by_year)
## # A tibble: 10 × 5
## Year mean_Time min_Time max_Time participants
## <int> <times> <times> <times> <int>
## 1 1973 00:58:12 00:55:14 01:01:11 2
## 2 1974 01:09:48 00:50:50 01:50:26 330
## 3 1975 01:09:58 00:51:47 01:39:45 239
## 4 1976 01:09:40 00:49:09 01:58:00 481
## 5 1977 01:09:38 00:49:44 02:44:16 720
## 6 1978 01:02:13 00:48:57 01:09:17 700
## 7 1979 01:17:00 00:48:00 02:10:30 2975
## 8 1980 01:13:21 00:47:08 02:05:57 1617
## 9 1981 01:16:32 00:47:17 02:08:47 3338
## 10 1982 01:16:20 00:49:29 02:01:39 3180
# Creates function that creates histogram and density plot pairs for dif years
# Takes in 2 arguments: 'plot_year' and 'cleaned_data'
plot_histogram_density <- function(plot_year, cleaned_data) {
# pulls out data from 'cleaned_data' based on 'plot_year'
subset_data <- cleaned_data[cleaned_data$Year == plot_year, ]
# makes histogram of data without plotting it
hist_data <- hist(subset_data$Time, breaks = 10, plot = FALSE)
# plots above generated histogram with color and labels
plot(hist_data, col = "lightblue",
main = paste("", plot_year),
xlab = "Time (hr:min)",
ylab = "Density",
freq = FALSE, xaxt = "n")
# custom x-axis with labels as hours and minutes
axis(1, at = hist_data$breaks,
labels = chron(times = hist_data$breaks, format = "h:mm"),
las = 2)
# adds density plot line in blue over histogram
lines(density(subset_data$Time), col = "blue", lwd = 2)
}
# creates 2x5 martix layout for plots
par(mfrow = c(2, 5))
# loops over years 1973:1982 and calls function above to
# generate histogram and density plot
for (year in 1973:1982) {
plot_histogram_density(year, cleaned_data)
}
There were only two participants in 1973 with age and time recorded.
- Plot the density curves for all ten years in one figure, along with a density curve of the combined data set (for a total of 11 curves). The main focus should be displaying the combined density, but all 11 densities should be discernible.
density_data <- density(cleaned_data$Time)
years <- unique(cleaned_data$Year)
line_colors <- c("red", "#ffd92f", "orange", "firebrick", "magenta",
"purple", "blue", "cornflowerblue", "darkgreen", "lightgreen")
custom_ticks <- c("00:42", "01:13", "01:45", "02:17", "02:49")
tick_positions <- seq(min(density_data$x), max(density_data$x),
length.out = length(custom_ticks))
tick_labels <- as.numeric(tick_positions)
p <- plot_ly() %>%
add_trace(type = "scatter",
mode = "lines",
x = density_data$x,
y = density_data$y,
line = list(color = "black", width = 5),
name = "All Years") %>%
layout(title = "Density Curve by Year",
xaxis = list(title = "Time (hr:min)", showline = TRUE,
tickvals = tick_positions, ticktext = custom_ticks),
yaxis = list(title = "Density", showline = TRUE),
showlegend = TRUE)
for (i in seq_along(years)) {
year <- years[i]
subset_data <- cleaned_data[cleaned_data$Year == year, ]
p <- add_trace(p,
type = "scatter",
mode = "lines",
x = density(subset_data$Time)$x,
y = density(subset_data$Time)$y,
line = list(color = line_colors[i], width = 3,
dash = ifelse(year %% 2 == 0, "solid", "dash")),
name = as.character(year))
}
p
The above plot is interactive. Click the lines in the legend to remove or add a line.
Question 2
- Correlating age and time: Create a scatter plot of age and race time, with time being the response. All ten year’s worth of data should be included, but you should be able to tell which year each point comes from. Include trend lines for each year, along with a trend line for the combined data set.
scat_plot = function(curr_year,cleaned_data)
{
sub_data = cleaned_data %>% filter(Year == curr_year)
plot(sub_data$Age ~ sub_data$Time,
main = curr_year,
xlab = "Time",
ylab = "Age")
abline(lm(sub_data$Age ~ sub_data$Time), col="red")
}
par(mfrow = c(2, 5))
for (curr_year in 1973:1982)
{
scat_plot(curr_year, cleaned_data)
}
Write a short interpretation of the plot as if you were explaining it to your client.
Looking at the red line showing the general linear trend of the runners we can see that as the ages of runners increased so did the time.
Question 3
- Relating age and times categorically:
We’re interested in the age composition for ten performance groups. The performance groups are defined based ten percentiles (10%, 20%,…100%) of relative finish position. For example, someone finishing 3rd out of 125 would be in the 10th-percentile group, while someone finishing 985 out of 1013 would be in the 100th-percentile group.
The age groups we’re interested in are defined by decade, so separate people in their 20’s from people in their 30’s and so forth.
Generate one plot that displays the age composition of each of the ten performance groups. Make sure you’re using all ten year’s worth of data.
Hint: You can compute performance groups manually fromYearandTime, or by carefully manipulatingPis/Tis.